Goto

Collaborating Authors

 ascend 910


AMLA: MUL by ADD in FlashAttention Rescaling

Liao, Qichen, Hu, Chengqiu, Miao, Fangzheng, Li, Bao, Liu, Yiyang, Lyu, Junlong, Jiang, Lirui, Wang, Jun, Zheng, Lingchao, Li, Jun, Fan, Yuwei

arXiv.org Artificial Intelligence

Multi-head Latent Attention (MLA) significantly reduces KVCache memory usage in Large Language Models while introducing substantial computational overhead and intermediate variable expansion. This poses challenges for efficient hardware implementation -- especially during the decode phase. This paper introduces Ascend MLA (AMLA), a high-performance kernel specifically optimized for Huawei's Ascend NPUs. AMLA is built on two core innovations: (1) A novel FlashAttention-based algorithm that replaces floating-point multiplications with integer additions for output block rescaling, leveraging binary correspondence between FP32 and INT32 representations; (2) A Preload Pipeline strategy with hierarchical tiling that maximizes FLOPS utilization: the Preload Pipeline achieves Cube-bound performance, while hierarchical tiling overlaps data movement and computation within the Cube core. Experiments show that on Ascend 910 NPUs (integrated in CloudMatrix384), AMLA achieves up to 614 TFLOPS, reaching 86.8% of the theoretical maximum FLOPS, outperforming the state-of-the-art open-source FlashMLA implementation, whose FLOPS utilization is up to 66.7% on NVIDIA H800 SXM5. The AMLA kernel has been integrated into Huawei's CANN and will be released soon.


Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

Ruggeri, Giuseppe, Andri, Renzo, Pagliari, Daniele Jahier, Cavigelli, Lukas

arXiv.org Artificial Intelligence

Deep Recommender Models (DLRMs) inference is a fundamental AI workload accounting for more than 79% of the total AI workload in Meta's data centers. DLRMs' performance bottleneck is found in the embedding layers, which perform many random memory accesses to retrieve small embedding vectors from tables of various sizes. We propose the design of tailored data flows to speedup embedding look-ups. Namely, we propose four strategies to look up an embedding table effectively on one core, and a framework to automatically map the tables asymmetrically to the multiple cores of a SoC. We assess the effectiveness of our method using the Huawei Ascend AI accelerators, comparing it with the default Ascend compiler, and we perform high-level comparisons with Nvidia A100. Results show a speed-up varying from 1.5x up to 6.5x for real workload distributions, and more than 20x for extremely unbalanced distributions. Furthermore, the method proves to be much more independent of the query distribution than the baseline.


Serving Large Language Models on Huawei CloudMatrix384

Zuo, Pengfei, Lin, Huimin, Deng, Junbo, Zou, Nan, Yang, Xingkun, Diao, Yingyu, Gao, Weifeng, Xu, Ke, Chen, Zhangyu, Lu, Shirui, Qiu, Zhao, Li, Peiyang, Chang, Xianyu, Yu, Zhengzhong, Miao, Fangzheng, Zheng, Jia, Li, Ying, Feng, Yuan, Wang, Bei, Zong, Zaijian, Zhou, Mosong, Zhou, Wenli, Chen, Houjiang, Liao, Xingyu, Li, Yipeng, Zhang, Wenxiao, Zhu, Ping, Wang, Yinggang, Xiao, Chuanjie, Liang, Depeng, Cao, Dong, Liu, Juncheng, Yang, Yongqiang, Bai, Xiaolong, Li, Yi, Xie, Huaguo, Wu, Huatao, Yu, Zhibin, Chen, Lv, Liu, Hu, Ding, Yujun, Zhu, Haipei, Xia, Jing, Xiong, Yi, Yu, Zhou, Liao, Heng

arXiv.org Artificial Intelligence

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.


Amid US restrictions, research points to new opportunities for China's most powerful AI chip

#artificialintelligence

Huawei Technologies' Ascend chip, China's most powerful artificial intelligence (AI) processor, can outperform Nvidia's flagship V100 chip in certain tasks, but also has some serious shortcomings, according to a new study by Chinese scientists. The researchers evaluated the Ascend processor's performance in various applications to gain the first in-depth look at China's growing competence as well as weakness in AI chip technology. Despite not being fully aligned with international flagship chips in overall performance, researchers said the Huawei Ascend processor could be used across most existing applications, and in some scenarios even surpass the performance of global competitors. The evaluation, carried out by researchers at China's Renmin University and Tsinghua University, was published in the peer-reviewed Chinese Journal of Computers in August, just before Washington banned US sales of the most powerful AI chips to China. Graphics processing units (GPUs) were originally developed to render images in video games, but in the past decade they have been increasingly deployed in the largest supercomputers by scientists and internet companies.


A Brand New Chip Design will Drive AI Development Analytics Insight

#artificialintelligence

The world is now heading into the Fourth Industrial Revolution, as Professor Klaus Schwab, Founder and Executive Chairman of the World Economic Forum, described it in 2016. Artificial Intelligence (AI) is a key driver in this revolution and with it, machine learning is critical. But critical to the whole process is the need to process a tremendous amount of data which in turns boosts the demand for computing power exponentially. A study by OpenAI suggested that the computing power required for AI training surged by more than 300,000 times between 2012 and 2018. This represents a doubling of computing power every three months and two weeks; a number that is significantly quicker than Moore's Law which has traditionally measured the time it takes to double computing power.


Huawei and Peng Cheng Laboratory Plan to Build 1000 PFLOPS Cloud Brain II AI Research Platform

#artificialintelligence

Huawei and Peng Cheng Laboratory (PCL) have jointly released Peng Cheng Cloud Brain II Phase 1, officially launching the journey to AI clusters at 1000 petaFLOPS (PFLOPS) scale. This marks a new milestone in the scientific research field for the Kunpeng computing industry. Running at the bedrock of Cloud Brain II is the Huawei Atlas 900 AI cluster, powered by the Huawei Kunpeng and Ascend processors. The computing power of Peng Cheng Cloud Brain is currently 100 PFLOPS, planned to scale to 1000 PFLOPS and higher next year. "This September, Huawei embarked on the Kunpeng Ascend dual-engine computing strategy. Inspired by this strategy, we are committed to providing the ultimate computing power to the world. We also released Atlas 900, the world's fastest AI training cluster," said Hou Jinlong, Senior VP of Huawei, and President of Huawei Cloud & AI Products and Services.


Top 11 Hot Chips For Machine Learning

#artificialintelligence

Though machine learning has been around for more than three decades, it took a lot of time for the hardware to catch up with the demands of these power-hungry algorithms. With each passing year, the chip-set manufacturers have tried to make the hardware lighter and faster. Today, over 100 companies are working on building next-generation chips and hardware architectures that would match the capabilities of algorithms. These chips are capable of enabling deep learning applications on smartphones and other edge computing devices. Intel recently revealed new details of upcoming high-performance artificial intelligence accelerators: Intel Nervana neural network processors.


Huawei Launches AI Ecosystem Program in Europe, with 100M Euros Investment in 5 Years

#artificialintelligence

This program unlocks a new chapter for the computing industry in Europe. According to Jiang Tao, VP of Intelligent Computing BU, "Huawei is committed to investing in the AI computing industry in Europe, enabling enterprises and individual developers to leverage the Ascend AI series products for technological and business innovation. Over the next 5 years, Huawei plans to invest 100 million euros in the AI Ecosystem Program in Europe, helping industry organizations, 200,000 developers, 500 ISV partners, and 50 universities and research institutes to boost innovation." First, Huawei will work with partners to shape the AI industry in Europe. Second, Huawei will develop joint solutions with ISV partners.


Huawei Launches AI Ecosystem Program in Europe, with 100M Euros Investment in 5 Years

#artificialintelligence

This program unlocks a new chapter for the computing industry in Europe. According to Jiang Tao, VP of Intelligent Computing BU, "Huawei is committed to investing in the AI computing industry in Europe, enabling enterprises and individual developers to leverage the Ascend AI series products for technological and business innovation. Over the next 5 years, Huawei plans to invest 100 million euros in the AI Ecosystem Program in Europe, helping industry organizations, 200,000 developers, 500 ISV partners, and 50 universities and research institutes to boost innovation." First, Huawei will work with partners to shape the AI industry in Europe. Second, Huawei will develop joint solutions with ISV partners.


World's Most Powerful AI Processor Meets the Market? Here Is How to Buy It

#artificialintelligence

This is one of Huawei's annual corporate-level flagship events. It has witnessed the release of many crucial strategies and products of Huawei. On HUAWEI CONNECT 2018, Huawei released its AI strategy and AI processors. In fact, after Huawei released the world's most powerful AI processor, Ascend 910, on August 23, its business model has attracted high attention from the industry. In response to media questions, Eric Xu, Huawei's Deputy Chairman of the Board and Rotating Chairman, said that the Ascend 910-based training service could be launched to the market in China in September this year and globally in the first quarter of next year.